Automated brain tumor identification using magnetic resonance imaging: A systematic review and meta-analysis

Abstract Background Automated brain tumor identification facilitates diagnosis and treatment planning. We evaluate the performance of traditional machine learning (TML) and deep learning (DL) in brain tumor detection and segmentation, using MRI. Methods A systematic literature search from January 2000 to May 8, 2021 was conducted. Study quality was assessed using the Checklist for Artificial Intelligence in Medical Imaging (CLAIM). Detection meta-analysis was performed using a unified hierarchical model. Segmentation studies were evaluated using a random effects model. Sensitivity analysis was performed for externally validated studies. Results Of 224 studies included in the systematic review, 46 segmentation and 38 detection studies were eligible for meta-analysis. In detection, DL achieved a lower false positive rate compared to TML; 0.018 (95% CI, 0.011 to 0.028) and 0.048 (0.032 to 0.072) (P < .001), respectively. In segmentation, DL had a higher dice similarity coefficient (DSC), particularly for tumor core (TC); 0.80 (0.77 to 0.83) and 0.63 (0.56 to 0.71) (P < .001), persisting on sensitivity analysis. Both manual and automated whole tumor (WT) segmentation had “good” (DSC ≥ 0.70) performance. Manual TC segmentation was superior to automated; 0.78 (0.69 to 0.86) and 0.64 (0.53 to 0.74) (P = .014), respectively. Only 30% of studies reported external validation. Conclusions The comparable performance of automated to manual WT segmentation supports its integration into clinical practice. However, manual outperformance for sub-compartmental segmentation highlights the need for further development of automated methods in this area. Compared to TML, DL provided superior performance for detection and sub-compartmental segmentation. Improvements in the quality and design of studies, including external validation, are required for the interpretability and generalizability of automated models.

Brain tumors present a significant burden on healthcare worldwide due to the neurological deficits produced and subsequent poor prognosis, with an average 5-year survival of 35% in malignant subtypes. 1 MRI is the gold standard modality engendering brain tumor diagnosis and subsequently informing surgical intervention, radiotherapy planning, and chemotherapy. Inevitably, qualitative MRI assessment has always been subject to high inter-rater variability, as well as being a notoriously laborious process. 2 However, the emergence of Artificial Intelligence (AI) has sparked the hope of overcoming these limitations.
The advent of Computer-Aided Diagnosis (CAD) using AI can potentially improve brain tumor patient outcomes. Traditional machine learning (TML) techniques have become widely used for image classification but are restricted by a requirement for specifying "feature vectors" for extraction from the raw data. 3 Conversely, deep learning (DL) techniques provide effective and automatic representation of complex image features, which has contributed to their increased popularity, 3 but the interpretation of automatically identified features remains a problem. 4 In addition, both TML and DL techniques are vulnerable to overfitting and selection bias. 5 Therefore, to safely use CAD in clinical settings, large robust studies which evaluate their quality and generalizability are crucial. 4 Holistic and standardized evaluation of scientific reporting is facilitated by established guidelines, such as the recently proposed Checklist for Artificial Intelligence in Medical Imaging (CLAIM). 6 The research on AI in neuro-oncology imaging has been amplified by the introduction of open access image datasets, such as the annual Multimodal Brain Tumor Segmentation Challenge (BRATS). 7 This provides the ideal foundation for an in-depth review to identify optimal automated methods. Three former systematic reviews and meta-analyses evaluated performance of AI-related techniques in neuro-oncological imaging. [8][9][10] However, these focused on specific brain tumor types and whole tumor (WT) segmentation, and none have evaluated subcompartmental segmentation nor addressed performance disparities between CAD and human expert segmentation. Moreover, there remains a paucity in comprehensively assessing the quality of studies in this field. We present the largest systematic review and meta-analysis that objectively evaluates performance of automated detection and segmentation techniques and assesses the reporting quality of included studies.

Search Strategy
This systematic review and meta-analysis were conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement 11 (PROSPERO; CRD42021247925). We searched PubMed, Web of Science, and Scopus for studies published between January 1, 2000, and May 8, 2021. The search was initially performed on June 19, 2020 and updated on May 8, 2021. The search strategy is found in the Supplementary Appendix. The search was limited to publications written in English. The citations of included articles were handsearched to identify additional appropriate articles.

Inclusion and Exclusion Criteria
Studies were included if they developed or validated a semi-automatic or fully automatic adult brain tumor detection or segmentation method using MRI. Exclusion criteria: (1) studies reporting tumor classification or tumor grading methods only; (2) studies utilizing MRI spectroscopy only for method development; (3) studies reporting methods on pediatric, pituitary, and/or brainstem tumors only; (4) abstracts or conference proceedings; and (5) no performance metrics reported.

Study Selection and Data Extraction
Extracted citations were imported into the Rayyan systematic review site (https://www.rayyan.ai) for study selection. Following removal of duplicates, titles and abstracts were screened, and full texts of relevant publications

Importance of the Study
Despite the increasing research on artificial intelligence techniques in medical imaging, their safe implementation into clinical practice depends on rigorous and generalizable evidence. This study systematically evaluated the performance of automated brain tumor detection and segmentation methods, and assessed the quality of reporting using the Checklist for Artificial Intelligence in Medical Imaging guideline. Although automated and manual methods in whole tumor segmentation performed comparably, manual methods performed better in sub-compartmental segmentation. Within automated methods, deep learning was found to be superior to traditional machine learning in detection and sub-compartmental segmentation, but explaining this was hindered by the paucity in reported methods of model interpretability. Less than a third of studies reported external validation of their automated method. The variability found in study reporting undermines the credibility of automated methods, impacting their benefit for patients and health systems. Hence, there is a need for adherence to international reporting standards and guidelines.

Neuro-Oncology Advances
reviewed. Study screening was completed by two independent reviewers (O.K., J.D.S.), with disagreements resolved through a consensus-based approach with the wider group.

Reporting and Quality Evaluation
The reporting quality of studies was assessed according to CLAIM. 6 The risk of bias and applicability was assessed using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) guideline, 12 with consideration of some CLAIM items (see Supplementary Appendix). Three reviewers (O.K., A.H., D.B.) independently appraised included studies with any disagreements resolved through consensus. A "good" domain was deemed by its reporting in ≥70% of studies.

Definitions
DL was referred to studies that utilized deep neural networks as their method of choice. TML was referred to methods not classified as DL. Detection studies were defined as those that reported performance results for techniques that identified the presence of a tumor in an image. Segmentation studies were defined as those that reported performance results for techniques that segmented brain tumors, whether it was WT, tumor core (TC), and/or enhancing tumor (ET) segmentations as defined by BRATS. 7 Following previous work, dice similarity coefficient (DSC) of ≥0.7 was considered to represent "good" overlap. 13

Statistical Analysis
A meta-analysis was conducted for both automated detection and segmentation studies to compare DL with TML methods and to evaluate the segmentation performance of CAD to that of manual experts. Studies providing performance metrics for their method on different datasets were assumed to be independent of each other. This is because we are interested in providing an overview of the two methods rather than exact point estimates.
For detection methods, contingency tables consisting of True Positive, False Positive, False Negative, and True Negative were constructed. For studies that did not directly provide contingency tables, missing data were calculated with Review Manager 5.3 (https://revman.cochrane. org/) using sensitivity, specificity, and number of images. If neither contingency tables nor sufficient data were reported for computation, then the study was excluded from meta-analysis. A unified hierarchical summary receiver operating characteristic model was developed for the detection meta-analysis. Summary estimates of sensitivity and specificity with 95% CIs were derived using the random-effects bivariate binomial model parameters and equivalence equations of Harbord et al. 14 The reason for using the hierarchical model is that it considers the correlation between sensitivity and specificity, accounting for within-study variability, as well as variability (also called heterogeneity) in effects between studies (ie, betweenstudy variability). Receiver operating characteristic (ROC) curves were used to plot summary estimates of sensitivity against false positive rate (FPR, ie, 1-specificity). The ROC curve plots also exhibit the uncertainty around the summary estimates via 95% confidence regions, and heterogeneity between accuracy estimates via 95% prediction regions.
Segmentation methods were evaluated using a random effects model, and reported in terms of pooled DSC, a universally used and reported metric. The restricted maximum likelihood estimator was used to calculate the heterogeneity variance (τ 2 ). The inverse variance method was used to calculate a pooled effect size. Knapp-Hartung adjustments were used to calculate the confidence interval. A prerequisite for study inclusion in the metaanalysis was reporting outcome of interest (ie, DSC), in combination with an SD. Subgroup analysis comparing tumor types was performed where possible. A comparative analysis was conducted to evaluate the performance of CAD versus human experts. Sensitivity analysis was performed looking at studies that only performed out-ofsample external validation. Subgroup or sensitivity analysis was avoided when the number of studies in a group is small (n < 5). Study heterogeneity was formally evaluated using Higgins' inconsistency index (I 2 ) (I 2 > 50% = significant heterogeneity). All analyses were performed in R (version 4.0.2, http://www.r-project.org/) using the tidyverse, metaDTA, dmetar, meta, and ComplexUpset packages.

Results
Our search identified 2367 records, of which 1515 records were screened ( Figure 1). An additional 22 texts were identified through cross-referencing. Two-hundred and sixty-two full texts were assessed for eligibility and 224 were included in the systematic review: 188 segmentation and 46 detection studies (10 studies reported both detection and segmentation results; see "Eligible Studies" in Supplementary Appendix). Forty-six segmentation 15 Figure 1). Most studies utilized a fully automated algorithm (n = 222; 94.9%). 80.7% (n = 189) used data from open-access repositories, with BRATS being the most popular of them (n = 156; 66.7%). 29.0% (n = 68) used local datasets, all of which were retrospectively collected data. 11.9% (n = 28) used both local and public datasets. 2.1% (n = 5) did not specify dataset(s) used (Supplementary Figure 2). Publicly available datasets are detailed in Supplementary Table 3.

Reporting Quality
Detailed CLAIM assessment is presented in Supplementary
However, only 1.3% (n = 3) of studies clarified missing data handling. No studies reported sample size calculations (CLAIM item 19). Less than two-thirds (n = 144, 61.5%) specified how data was partitioned (CLAIM item 20). Only 32.5% (n = 76) of studies reported uncertainty around performance metrics (CLAIM item 29). 67.1% (n = 157) studies reported performing internal and/or external validation (CLAIM item 32). Just 2.6% (n = 6) specified inclusion and exclusion flow of participants or images (CLAIM item 33) and only 6% (n = 14) defined demographics and clinical characteristics of cases in each partition (CLAIM item 34). Ten studies made the algorithm source code publicly available (CLAIM item 41; for available links to source codes see Supplementary Table 8). Table 9 (segmentation) and Supplementary Table 10 (detection). In the patient selection domain of risk of bias, 21.4% (n = 50) studies were considered to have unclear or high risk of bias as they did not express the exclusion criteria in the utilized dataset(s). In the reference standard domain, 13.2% (n = 31) were deemed to have unclear or high risk of bias as they did not clearly define how the ground truth segmentation was derived. In terms of applicability, the main source of concern was in the index test domain; 31.6% (n = 74) had high applicability concerns as they did not validate the algorithm (Supplementary Figure 10).
Segmentation meta-analysis-Due to limited numbers of semi-automated studies, segmentation meta-analysis solely focused on fully automated methods. Forty-six fully automated segmentation studies provided sufficient data to be included in the meta-analysis. 34 Table 13).
Since few studies applied their segmentation techniques to meningiomas and nerve sheath tumors, they could not be included in subgroup analyses. The subgroup analysis thus compared HGG, LGG, and metastatic brain tumors. Only WT segmentation results for metastatic brain tumors were possible to compute due to limited studies. ET segmentation was predominantly performed on HGG, thereby excluding it from subgroup analysis. It was not possible to compare DL and TML methods in diagnosing different types of tumors due to the small number of studies.
For WT segmentation, no difference was observed between HGG, LGG, and metastatic tumors, 0. 83  Automated versus human expert segmentation-Only 30.4% (n = 14/46) of studies provided sufficient data for comparison between automated and expert manual segmentation for WT and TC segmentation. All studies included multiple (>1) independent expert operators for generating ground truth segmentations; one study (7.1%)   Table 14).

Discussion
To date, this is the largest meta-analysis evaluating automated brain tumor segmentation and detection methods. Automation provides benefits including elimination of human inter-rater variability and reduced inference time 2 ; particularly DL methods, which showed an impressive median inference time of 0.2 seconds/MRI slice.
Previous studies have concluded that, in general, automated methods are comparable to human expertise in terms of performance. 10,96 However, our research highlights that this only holds true for WT segmentation in brain tumors. Notably, we found that manual methods outperformed automated techniques for TC segmentation. Sub-compartmental segmentation, including TC, is a major influence on tumor progression monitoring and radiotherapy planning. 97 Hence, our finding cautions the application of machine learning in all its potential uses in routine clinical practice and highlights the need for further research on sub-compartmental automated segmentation (TC and ET). Since most methods used conventional MRI scans (ie, T1, T2, T1CE, and FLAIR), future studies could combine these multimodal sequences with other specialized MRI sequences to increase the number of features, assessing for potential enhanced segmentation results. Soltaninejad et al. 30 and Durmo et al. 98 incorporated features obtained from diffusion-weighted and diffusion tensor imaging and showed promising results in the automated identification of brain tumors. Including other MRI sequences in publicly available datasets, such as BRATS, could facilitate investigations into the diagnostic value of additional features.
Regarding automated detection, we have replicated the findings of Cho et al.'s systematic review on brain tumor metastasis 8 ; DL had a significantly lower FPR than TML, whilst sensitivity between the two methods remained similar. To the best of our knowledge, there has been no previous evaluation of automated sub-compartmental segmentation of brain tumors. Our study extends confidence in DL to tumor segmentation; the DL group achieved

Neuro-Oncology Advances
"good" (DSC ≥ 0.7) performance for all segmentation types (WT, TC, ET), whereas for TML, "good" performance was limited to WT segmentation. This trend persisted with sensitivity analysis investigating only externally validated studies, reinforcing these results. DL techniques support the automatic identification of complex features unlike TML, which requires hand-crafted feature vectors. 3 However, the advantages of DL remain ambiguous, due to its "black box" nature; the interpretability of learned features and the explainability of the model's decisions could be improved. 3,4 Certain methods, such as saliency maps or feature attribution attempt to deduce how these learning algorithms detect complex features. 99 However, just 2.1% (n = 5) of studies reported such methods, hindering model interpretation. This highlights the importance of future work reporting DL interpretation to improve comprehension and transparency of algorithmic predictions.
Van Kempen et al. 9 reported good performance of machine learning algorithms for glioma WT segmentation, also showing that automated segmentation for both HGG and LGG were comparable. Our subgroup analysis, stratified by tumor type, showed "good" performance, and no statistically significant difference between tumor types for WT segmentation. However, this was not consistent for TC segmentation; both HGG and LGG tumors did not reach "good" performance as was evident for WT. This is clinically pertinent, because of the aforementioned value of reliable automated sub-compartmental segmentation in treatment pathways. HGG TC segmentation performance was found to be significantly better than LGG. This may be due to LGG's slow growth, lack of surrounding vasogenic edema, and poor enhancement on MRI, making LGGs radiologically more difficult to identify. 98 Moreover, HGGs are highly proliferative tumors resulting in higher lesion contrast and enhancement, making them radiologically more noticeable. 98 This study shows that although manual WT segmentation statistically outperformed automated segmentation for HGG, both achieved "good" performance (DSC ≥ 0.7). On the other hand, for LGG tumors, manual and automated segmentation were statistically comparable in terms of performance; however, only manual segmentation achieved "good" performance. This could be because LGGs can simply conform to normal anatomy (eg, expanding gyri), making them difficult to diagnose, especially when small. This further highlights the need for future work on improving machine learning performance to segment LGG more accurately to achieve comparable results to that of manual segmentation.
Reporting guidelines reinforce robust evaluation and generalizability of diagnostic models. The recent CLAIM checklist, developed on the foundations of earlier well-established guidelines, is the first to address AI applications in medical imaging. This is the first study to adopt this pertinent guideline for the comprehensive assessment of reporting quality for brain tumor identification. Although over 70% of studies detailed data sources, model design, and ground truth definitions, only a minority reported missing data handling, data partitioning, study participant flow, and external validation. This is consistent with Yusuf et al.'s systematic review 5 which found poor reporting of the study participant flow, the distribution of disease severity, and model validation techniques within ML-based diagnosis models. Such findings reiterate the necessity for studies to employ guidelines to aid their interpretation and reusability. This is paramount in ensuring reliable research is the basis of pioneering novel techniques into clinical practice.
The absence of external validation jeopardizes the generalizability of models for clinical use. Our study highlights such a limitation, with only 41.3% (n = 19/46) of segmentation and 2.6% (n = 1/38) of detection studies in the meta-analysis undertaking external validation.

DL Detection methods (20 tables)
A B  To address this, we performed a sensitivity analysis on segmentation models that were externally validated, which showed similar results to the original analysis. To ensure that future studies externally validate their machine learning algorithms, authors should utilize the CLAIM guideline when reporting their study. In addition,

Neuro-Oncology Advances
journals should encourage authors to provide details about elements of reporting outlined CLAIM for editors and reviewers during the assessment of AI-related manuscripts in medical imaging. Secondly, high heterogeneity was observed which may be due to methodological diversity in machine learning techniques. Thirdly, only a quarter of included studies were eligible for metaanalysis because of inadequate reporting, particularly the uncertainty values of performance metrics, thus compromising data availability. This issue has been recognized by non-neuro-oncology systematic reviews. 96 Fourthly, most studies failed to report manual segmentation results, impeding a direct comparison of the techniques. To promote standardization of ground-truth images for training AI algorithms, experts should utilize structured reporting during manual segmentation. 100 Finally, most studies tested and trained their algorithms on open-access datasets. We propose that available automated algorithms be applied to prospective, routinely collected MRI data to assess performance and feasibility for use in daily clinical practice.
To conclude, we found promising results for the use of AI algorithms in brain tumor identification and highlight the areas for future research. Further improvements to study design are needed, with adherence to reporting guidelines, which will avail transparent evaluation and generalizability of diagnostic AI models.